GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB by lriggs · Pull Request #50187 · apache/arrow

lriggs · 2026-06-16T01:32:50Z

Rationale for this change

Gandiva's REPLACE hardcodes a 65535-byte output buffer, throwing Buffer overflow for output string whenever the result exceeds 64 KB. The cap is arbitrary: Gandiva's variable-length output column already grows dynamically and is only bounded by the int32 offset width (~2 GB). Real queries that replace into large concatenated/aggregated strings fail unnecessarily.

What changes are included in this PR

replace_utf8_utf8_utf8 now sizes the output buffer to the exact result instead of using a fixed cap. The output length of a replace is deterministic:

out_len = text_len + num_matches * (to_str_len - from_str_len)
The wrapper does a single counting pass over the input to find the number of non-overlapping matches of from_str (mirroring the match loop already used in the implementation), computes the exact size in gdv_int64 to avoid intermediate overflow, and passes that as max_length.

The internal replace_with_max_len_utf8_utf8_utf8 is unchanged — its bounds checks now act purely as a correctness backstop (they should never fire with an exact bound), and its explicit-max-length signature remains for the existing unit tests.
When to is shorter than from, the result shrinks and max_length <= text_len, so the shrinking path is sized correctly too.

Are these changes tested?

Yes. Added regression cases to TestStringOps.TestReplace in string_ops_test.cc:

A 35000-char 'X' input with X → XY, producing a 70000-byte result (previously overflowed at 65535) — asserts no error and exact length/content.
A 70000-char shrinking case (XX → X) to cover the shrink path on a >64 KB input.
Full precompiled suite passes locally (132/132), including the existing explicit-max_len overflow tests, which call the internal function directly and are unaffected.

Are there any user-facing changes?

REPLACE now succeeds on results larger than 64 KB instead of erroring. No API or signature changes.

GitHub Issue: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50186

github-actions · 2026-06-16T01:33:16Z

⚠️ GitHub issue #50186 has been automatically assigned in GitHub to PR creator.

Gandiva's REPLACE hardcoded a 65535-byte output cap, throwing "Buffer overflow for output string" whenever the result exceeded 64 KB. Size the output buffer to the exact result instead, by counting non-overlapping matches of from_str: text_len + num_matches * (to_str_len - from_str_len). Removes the arbitrary cap; the internal replace_with_max_len variant and its bounds checks are unchanged. Gandiva variable-length output uses int32 offsets, so a single output string cannot exceed INT_MAX (2 GB). Guard that boundary explicitly with a clear error message instead of letting the int32 size cast wrap silently (which could otherwise lead to under-allocation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dmitry-chirkov-dremio · 2026-06-16T19:41:25Z

-                                             from_str_len, to_str, to_str_len, 65535,
-                                             out_len);
+                                             from_str_len, to_str, to_str_len,
+                                             static_cast<gdv_int32>(max_length), out_len);


Do we even need this argument in the function signature?

Yes, thats how the function know how much memory to allocate for the output string.

dmitry-chirkov-dremio · 2026-06-16T19:44:16Z

  EXPECT_EQ(std::string(out_str, out_len), "TestString");
  EXPECT_FALSE(ctx.has_error());

+  // Large output (>64 KB) must not overflow: buffer is sized to the exact result.


I'd feel more comfortable with few specific edge cases but I would not block the PR. Examples:

INT_MAX resulting size

0 resulting size

Tested one byte past the limit (INT_MAX + 1) rather than exactly INT_MAX, because a result of exactly INT_MAX is allowed and would force a real ~2 GB arena allocation — not something to do in a unit test. The boundary test confirms the guard triggers at precisely the right point.

dmitry-chirkov-dremio · 2026-06-16T19:46:44Z

                                   gdv_int32 to_str_len, gdv_int32* out_len) {
+  // Count non-overlapping matches to size the output buffer exactly, so large
+  // results are not capped by an arbitrary limit.
+  gdv_int64 num_matches = 0;


We just reviewed base64 encoding few weeks back and didn't accept double pass variant. See if you can reduce the logic back to single pass or provide benchmarks for single-vs-double variants (0/few/lots replacements) as perhaps resizing is more costly.

old vs new is measured without a separate build: the old wrapper was literally replace_with_max_len_utf8_utf8_utf8(…, 65535, …), and that inner function is unchanged. So new = replace_utf8_utf8_utf8 (counting scan + exact alloc) and old = replace_with_max_len_utf8_utf8_utf8 given an adequately-sized buffer (the pre-change algorithm, no scan). The delta is purely the O(n) scan.

Results

case text_len matches new ns/call old ns/call delta

small/dense expand a->ab 256 256 3013.5 2250.3

small/sparse expand a->ab 256 4 1738.6 922.1

medium/dense expand a->ab 65536 65536 753132.4 573520.9

medium/sparse expand a->ab 65536 1024 436435.6 224018.2

large/dense expand a->ab 4194304 4194304 52924267.0 40737654.2

large/sparse expand a->ab 4194304 65536 28408644.3 14630748.5

large/dense shrink ab->a 4194304 0 12771361.1 12659342.5

Interpretation:

The scan's relative cost is consistent across sizes (~30% dense, ~90% sparse), as expected for an O(n) pass — it scales with input length, not match count.
Sparse is the worst case (+90%): with few matches, the old replace pass is mostly cheap byte-copying, so adding a full second comparison pass nearly doubles the work. Dense (+30%): the replace pass already does expansion-copying, so the scan is a smaller fraction.
Shrink path: +0.9% — confirms the refinement works; when to ≤ from the scan is skipped and there's no measurable overhead.

github-actions Bot added Component: C++ Component: Gandiva awaiting review Awaiting review labels Jun 16, 2026

lriggs force-pushed the replaceLimit branch from 2cc9cb1 to e2b7c99 Compare June 16, 2026 19:24

dmitry-chirkov-dremio reviewed Jun 18, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 18, 2026

More tests

6d6bf39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187

GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187
lriggs wants to merge 2 commits into
apache:mainfrom
lriggs:replaceLimit

lriggs commented Jun 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Uh oh!

lriggs Jun 18, 2026

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Uh oh!

lriggs Jun 18, 2026

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Uh oh!

lriggs Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

case	text_len	matches	new ns/call	old ns/call	delta
small/dense	expand a->ab	256	256	3013.5	2250.3
small/sparse	expand a->ab	256	4	1738.6	922.1
medium/dense	expand a->ab	65536	65536	753132.4	573520.9
medium/sparse	expand a->ab	65536	1024	436435.6	224018.2
large/dense	expand a->ab	4194304	4194304	52924267.0	40737654.2
large/sparse expand	a->ab	4194304	65536	28408644.3	14630748.5
large/dense	shrink ab->a	4194304	0	12771361.1	12659342.5

Conversation

lriggs commented Jun 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lriggs Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lriggs Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

dmitry-chirkov-dremio Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lriggs Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lriggs commented Jun 16, 2026 •

edited by github-actions Bot

Loading